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Introduction 

We are interested in recovering an unobscrvablc signal Xq based on observations 
> 
OV »(**) = Txo{U) + ei, (1) 

in '. 

qq , where T : X — > Y is a known compact linear operator between Hilbert spaces 

TJ- ■ X, Y. We cannot observe our target function x directly, but rather through 

r — , a blurred (by the linear filter T), noise corrupted sample of y over a collection 

f^i ' of discrete observation points t{, i = 1, . . . ,n. Throughout the paper, we shall 

00 I denote y = (j/(*i))"=i- 

We assume that the observations y(U) <E R and that the observation noise £j 
are i.i.d. realizations of a certain random variable e. 

We will say the problem is ill-posed if the generalized inverse operator T + : 

Y — * X defined by (T*T)~ 1 T, where T* stands for the transpose operator, is 

C^ ' unbounded. We In this case, this inverse operator needs to be, in some sense, 

regularized. 

Regularization methods replace an ill-posed problem by a family of well-posed 
problems. Their solution, called regularized solutions, are used as approxima- 
tions of the desired solution of the inverse problem. These methods always in- 
volve some parameter measuring the closeness of the regularized and the original 
(unrcgularized) inverse problem. Rules (and algorithms) for the choice of these 
regularization parameters as well as convergence properties of the regularized 
solutions are central points in the theory of these methods. 

The statistical problem has been extensively studied, although in general 
efficient regularization-parameter choice is still under active research. Two main 
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kinds of estimators have been considered in the literature. First regularized 
estimators such as Tikhonov type estimators, and second non linear thresholdcd 
estimators. The first approach has been studied in great detail. An interesting 
early survey of this topics are provided in [20, 19]. In this setting, the main 
issues are, what kind of regularization functional should be considered, and, 
closely related, what the relative weight of the selected regularization functional 
should be. More recently, Mair and Ruymgaart in [18] or Bissantz ct al. in [3] 
studied different regularized inverse problems and proved the optimality of the 
rate of convergence for their estimators assuming the regularity of the target 
function xo to be known. Special attention has been devoted in this setting when 
considering a Singular value decomposition (SVD) of operator T. We cite the 
recent work in this direction in [7, 6]. The second approach has its most popular 
version in the wavelet-vaguelet decomposition introduced in [9]. In this case the 
main issue is finding an appropriate basis over which T + , the generalized inverse, 
is almost diagonal. This idea is further developed in [12] who introduce mirror 
wavelets. Closely related, Cohen et al. in [8] construct an adaptive thresholdcd 
estimator based on Galerkin's method. We also cite the recent work in [14], 
where they combine an SVD approach with a thresholding technique over a 
certain new basis. 

Reconstructing the unknown target function xq is related to four issues. To 
the filter T, to the probabilistic structure of the noise, to the fact we are only 
observing T(xq) over a finite observation scheme and finally to the regularity of 
xq. In order to unify notation, our assumptions will be presented in terms of an 
underlying basis of Y, {ipj}j^fi> an d the increasing sequence of approximating 
spaces Y m :=< ipj >j£d m - The ill-posedness of T will then be defined in terms 
of these subspaces Y m and the discrete observation scheme. As is usual in the 
numerical literature, the regularity of Xq will be defined in terms of operator T 
(for a detailed discussion see [11] or [13]). 

Our goal in this article is to develop algorithms providing estimators of xq 
that achieve optimal rates of convergence within the regularization method when 
the smoothness of the true solution is not known a priori. For this, we focus on 
model selection techniques for regularization procedures. As in the work of [5], 
[15] or [16], we choose the best regularization scheme among a set of regular- 
ization operators by minimizing a contrast and a well chosen penalty. Since the 
choice of a penalty is crucial, we provide a very general way of calibrating the 
penalty with respect to the regularization operators. This enables us to build 
optimal estimators, within the chosen regularization method, for inverse prob- 
lems for two general classes of estimators, Tikhonov estimators and projection 
estimators. 

The article is divided into four main parts. In Section 1 we describe the 
general framework and provide several assumptions. In Section 2 we build a 
general penalty which leads to an oracle inequality for regularization operators 
and shows optimality of the adaptive procedures. In Section 3 we apply the above 
results to Tikhonov regularization operators and projection operators. Section 
4 is devoted to technical lemmas providing deviation bounds which are used in 
order to obtain non asymptotic bounds for the quadratic risk of our estimator. 
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1. Model description and general assumptions 

In this section we introduce our general notation and assumptions. 

We want to estimate a function xo : R — * R based on the blurred and noisy 
observations 

y(ti) =Txo(ti) +Si, :i = l,...,n. 

It is important to stress that the observations depend on a fixed design (t\, 
. . . ,t n ) £ R n . This will require introducing an empirical norm based on this 
design. Let Q n be the empirical measure of the co- variables Q n = i Ym=i ^*i > 
where 5 is the Dirac function. The L2(<5 Tl )-norm of a function y £ Y is then 
given by 

1/2 

y 2 dQ„ 



1 
and the empirical scalar product by < y, e >„= — / Eiy(ti). Remark that this 

i=l 

empirical norm is defined over the observation space Y. Over the solution space 
X we will consider the norm given by the Hilbert space structure. For sake 
of simplicity, we will write ||.||x = ||-|| when no confusion is possible. Over a 
finite dimensional space, the norm ||.| will always stand for the Euclidean norm 
and if v £ R d , v l will stand for the transpose vector. Likewise for any matrix 
A S R <ixr , A 1 will stand for the transpose matrix and A+ := (A t A)~ l A t for the 
generalized inverse. Considered as an operator, we will write A* for the adjoint 
of the corresponding operator. 

We also introduce certain standard assumptions on the observation noise 

AN moment condition for the errors 

e is a centered random variable satisfying the moment condition Ede^/er 9 ) < 
ql/2 for all q>l, with E(e 2 ) = a 2 . 

As usual in statistics, assume that X satisfies a certain smoothness condition. 
In this paper, we assume the following source assumption encountered typically 
in the inverse problems literature. 

SC source condition 

There exists v > such that x Q G Range((T*T) u ) := K((T*T) U ) 

Moreover consider 

A v . p = {xe X, x = (T*T)"w, JMI < P) 
where ^ v ^ vq, isq > and use the further notation 

A u =\jA u , p =H{(T*T) v ) (2) 

These sets are usually called source sets, x £ A v is said to have a source 
representation. 



J.-M. Loubes and C. Ludena/ Adaptive complexity regularization 664 

Remark 1.1. Such condition is usual in analysis of inverse problems (see [11]). 
It links the decay of the eigenvalues of the operator with the decay of the 
coefficients of the decomposition of the function in the SVD basis. Thus, the 
parameter v can be seen as a smoothness parameter providing a restriction over 
the regularity of the function to be recovered. Indeed, let (A,-, <j)j, <fj)j^i be the 
singular value decomposition of the operator T, in the sense that the following 
decomposition holds 

oo 

T*Tx = Y f ^<x,4> 1 

3=1 

Hence, xq € A v if and only if 



".i ■ 



E l < X ,(f)j > 
\2+4v 



< 



The parameter v is not known a priori, hence adaptation results will be provided 
with respect to this smoothness parameter. 

Estimating over all X is in general not possible because we can only observe 
Txq over the fixed design (ti, . . . , t„). Thus we assume that we are equipped with 
a sequence of nested linear subspaces whose union is dense in Y, Y\ C Y% ■ . ■ C 
Y m . . . C Y . We assume dim(Y m ) = d m . We are interested in a subcollection of 
these spaces generated by a set of indices M. n . In this paper, we will use these 
approximation spaces as projection spaces in order to study the data. So, denote 
the projection of any space W over any subspace Z by ILzW. Let Uy stands 
for the projection in the empirical norm. Set also the corresponding projected 
operator T. m = IT^T. 

Using a sieve of the space Y, we consider the corresponding approximation 
spaces in the space X, defined as X m = T^Y. By construction 

n Xm = {n^ m T) + n^ m T. 

We point out that both T m and its adjoint operator T^ depend on the observa- 
tion sequence tj. However, we will usually drop this fact from the notation. To 
illustrate this assertion, consider the following example. 

Example 1.1. ifY m is generated by some orthonormal basis = ((/>!, . . . , </>d m ), 
with respect to the L 2 norm over Y , and T = Id, then 

dm 

i=l 

where y^ n =< n^ y, <j>j > n are the solution to the projection problem under 
the empirical measure Q n . Set G m = {4>j(U))- ■, i = 1, . . . , n and j = 1, . . . , d m . 
Thus, we may write in matrix notation 

n^ m y = (GlG^GlMh), ■ ■ .,y(t n )). 
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An important issue is that, as above, we can always define T m in matrix 
notation and thus T^y always makes sense. Moreover, if u £ Y, we will use 
indistinctly T^u £ E dm and T^u £ X m , the latter in operator notation. 

Now, define p the degree of ill-posedness of the forward operator T. In our 
case we will relate this parameter to the approximation properties of the spaces 
Y m , i.e with the projection operator Hy as follows. 

IP ill-posedness of the operator Let A4 n be an index set. For m £ A4 n 
there exists a parameter p > such that 

7 < m) :=|l(J-n?jT||=0(O- 

p is denoted the index of ill-posedness of the forward operator. In the 
following, this parameter is supposed to be known. 

To illustrate this condition we include the following example 

Example 1.2. The above assumption can be seen to hold under certain condi- 
tions over operator T and matrix G m defined in example (1-1). Let (Xj,(j)j, <fj)j 
be the singular value decomposition of the forward operator T and assume that 
there exists p > such that Xj = 0(j~ p ). This parameter is in the statistical 
literature often considered as an index of ill-posedness, since the difficulty of 
the estimation in inverse problem usually comes from the difficulty to invert the 
operator due to the decrease of its eigenvalues. But, if we let Y m be the linear 
subspace generated by {fj}i<j<d m , then assume that the fixed observation de- 
sign U,i = 1, ...,n is such that this basis is also orthogonal in the empirical 
norm. Assume also that sup =1 dm \\<fj\\ oo < oo. Then, 

i - n£ m = / - ny m + n?. m (i -u Ym ), 

where 



and 



\i-n Ym \\=o(d-?) 



mil - IL Ym )u\\ =\\Tl Y (I - IL Y Ju\\ 



<||(/-ny m H| n < sup ||^-||oo||(/-ny m )u||, 

leading to the approximation property provided in [IP] . If the forward operator 
acts in a smoothing way as integrating p times, for example, the eigenvalues Xj 
satisfy the reguired decay rates (see [IS]). 



Define 



ir+n?- 



This quantity controls the amplification of the observation error over the solution 

space X m . Consider 

7m := inf \\T^v\\, 
veY m ,\\v\\=i 
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which expresses the effect of operator T* n over the approximating subspace Y m . 
We have as in [13], v m > j m . 

On the other hand this term is related to the goodness of the approximation 
scheme. Following the proof in [13], it can be seen that 

7 m+ i<||T*(/-n^ m )|| = ||(i-n5 m )T||. 

The next assumption requires that j m and <-y( m ) are of the same order, which 
will be written 7 TO ~ <-y( m ). 

AS amplification error 

We assume 

lm = 0(d m P). (3) 

Moreover assume there exists a positive constant U such that 

i — <Vu. 

lm 

In the estimation procedure, we will first project the data onto a large enough 
approximation space indexed by mo, to be selected later in this paper. For this 
fixed mo, let {Xj,4>j, Pj)j = l,---,dm be the singular value decomposition 
of operator T mo . For any u £ Y we can write T^u = J2j'=i ^j4>j < u , ¥j >ni 
which depends only on u = (m(xi), . . . , u(x„))'. As previously recall that G mo € 
Md mQ , n is defined by G — {yj{xi))j,i, i — 1, ■ • ■ , n, j — 1, , d mo . Thus, abusing 
notation we may write T l mQ = D(G mo G t mo )~ 1 G mo : W l -*■ K d '"o, where D = 
D(\j)j=i t ... t d m is the diagonal matrix with entries Xj. Since T^u = T^ u the 
latter in operator notation, both interpretations will be used indistinctly. On 
the other hand, for x G X mo , identified with a d mo dimensional vector, we can 
think of (T mo x(£i), . . . ,T mo x(t n )) = G m Dx. So that in matrix notation also 

± m ± m — ^ ■ 

We also introduce the following assumptions 

SV There exist positive constants k\ < k 2 such that kij~~ p < Xj < kij~ v . 
SF Let Vj, j = 1, . . . , d mo be the eigenvalues of matrix G t G, then there exist 
constants a\ < a 2 such that a\n < Vj < a 2 n. 

Remark 1.2. 

• Assumption AS thus establishes that the worst amplification of the error 
over X m is roughly equivalent to the best approximation over Y m in the 
empirical norm. This yields a uniform bound on the operator TT+IIy , so 
that regularization by T compensates the bad condition of the approxima- 
tion T+Tl^ m to T+ (see [11] or [13] for further comments). Notice T+n^T 
is just the projection operator over X m so it is a bounded operator. 

• Assumption SV is slightly stronger than AS as it establishes the exact 
order of the j m . It is seen to hold, for instance, in example (1.2). Assump- 
tion SF is necessary to assure convergence results further on. It holds also 
in example (1.2). 
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• We point out that the Assumptions are met as soon as the approximation 
spaces are constructed as in example (1.2), i.e called truncated SVD where 
the spaces are defined as Y m = spanj^j, j = 1, . . . , d m }, where we recall 
that (fj are eigenfunctions corresponding to the eigenvalues of T*T sorted 
in decreasing order. Since eigenfunctions are often hard to obtain explic- 
itly, for practical purpose we are more interested in the usual piecewise 
polynomial projection bases or wavelet bases, though, as in [13]. 

2. Adaptive choice for regularized estimator 

Consider a class of regularized estimators built using a projection and a regu- 
larization procedure. Namely let Y mo be a big enough space in the sense that 
mo is such that 

||(/-n Xm( >o||< mf W-Ux 

This quantity can be chosen so as not to depend on the unknown regularity of 
the solution Xq. Under assumption SC the above inequality is satisfied if the 
dimension of the set is such that 

"mo — 

Thus it is enough to choose mo such that d mo > n 1 /( 2p+1 ^. 

For K. n a set of indices, consider {R k , k £ /C„} a collection of regularization 
operators which depend on different values of the smoothing parameters. For 
instance consider Tikhonov regularization operators which rely on the choice of 
a smoothing sequence, Landweber iteration operators which rely on the choice 
of a stopping index, or other general smoothing operators described in [10]. 
Consider the corresponding estimators 

x k := R k W± ma y = R k y, (4) 

where we have written R k '■= -Rfclly . The behavior of such general estimators 
depends on the choice of the regularization sequence. From the theory of inverse 
problems, we know that it is possible to choose a regularization operator for 
which the corresponding estimator achieves the optimal rate of convergence, 
but this choice depends on v defined in SC, which characterizes the regularity 
of the solution. 

Our aim is building a method that picks, according to the data, an optimal 
R k , among all the Rk, k G K, n in such a way that optimal rates are maintained. 
This choice must also not depend on a priori regularity assumptions. We point 
out that selecting the optimal smoothing parameter in a collection of sequences, 
belongs to model selection theory since it is equivalent as selecting a good model 
among a collection of sets. 



J.-M. Loubes and C. Ludefia/ Adaptive complexity regularization 668 

For this consider the following penalized procedure. For a given constant 
r > 2 and weights L k , k £ JC n to be chosen, define the penalty as 

pen(fc) := ra 2 (l + L k )[Tr{R l k R k ) + p 2 (R k )] 1 

where Tr{R k R k ) is the trace and p(R f k R k ) = p 2 (R k ) is the spectral radius. 
Finally k is selected as the solution of 

fc:=argmin { \\R k (y - T(x k ))\\ 2 + pen(fc)} , (5) 

k£K n 

which defines the estimator xz = R k y. Let x k = R k Txo be the regularized true 
function, which measures the accuracy of the estimation procedure without 
observation noise. The following result states the asymptotic behaviour of the 
estimator xt- 

Theorem 2.1. Assume AN and SF are satisfied. There exists a constant C 
which depends on r and on T , such that the following inequality holds true 

E||^-*o|| 2 <2||(/-n Xm > || 2 + C mf [\\x k - x \\ 2 + 2pcn(fc)] +^, (6) 
where we have set 

S(d) = E 2 



fce/c„ 



dTr{R\R k 



P 2 (Rk) 



P 2 (nR k ) 



-l 



,-y/ 'dL k [Tr(RlR k )+^(R k )}/ p*(R k ) 



for d as in lemma 4-3. 

Hence, the estimator is optimal in the sense that the adaptive estimator 
achieves the best rate of convergence among all the regularized estimators, up 
to an error of order pen(fc) and S(d)/n. This bound is non asymptotic and the 
rate of convergence depends on both previous terms. 

We point out that the constant r and the weights must be chosen in order to 
control respectively the penalization term and the constant E(d). The weights 
are technically introduced to achieve the so called Kraft inequality 1 S(d) < +oo 
and hence to control the size of the set of parameters K„ . 

We also point out that under SF p 2 (nR k ) and Tr(R k R k )/p 2 (R k ) do not 
depend on n. 

Proof. For any x k and any k € N, the mere definition of the estimator xt implies 
that 

\\R k (y - Tx k )\\ 2 +pen(k) < \\R k (y - Tx k )\\ 2 + pen(k) 

and 

\\Rk(y - Tx k )\\ 2 = \\R k T(x - x k )\\ 2 + 2{R k T(x - x k ), R k e) + ||i? fc e|| 2 
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Thus, following standard arguments we have 

\\R- k T(x -x- k )\\ 2 
sC \\R k T(x -x k )\\ 2 -2< R k T{x Q - x k ), R k e > 

+2 < R k T(x - xk), Rke > ~\\R k e\\ 2 + \\Rk£\\ 2 + pcn(fc) - pen(fc). 

Let < k < 1. Since 2ab ^ na 2 + -b 2 , for any a, b we have for any k and 
Xk € X TO 

(l-«)||^T(xo-%)|| 2 

< (1 + K)\\R k T(x - x k )\\ 2 + {2 + ~ Jpen(fc) 

+ 2 sup i ( 1 + - J \\R k £\\ 2 - pen(fc) 
fce/c„ IV K J 

On the other hand, using that 1 ^ RkT Sj C, we have that for any x k G X mo 

and any k <G N, 



(1 - k)||.t - %|| 2 < C(l - k)||.t - x k \ 
2 + - ) pcn(/c) + 2 Ci sup { ||i? fc e|| 2 - ( pen(fc) 



feeK:,, 



K 



The proof then follows directly from Lemma 4.3 which characterizes the supre- 
mum of the empirical process under the linear application as defined by the 
regularization family. The choice of k will depend on r in the penalty. □ 

3. Applications to standard regularization operators 

Penalized estimators are widely used to solve linear inverse problems and can 
be written in the form (4). Indeed for k € K, consider a sequence of (diagonal) 
matrices A k g M moX m (K)- Then, define for a chosen sequence A k the following 
penalized estimator 



x k := arg mm 

xGX m 



HIT" ( y -Tx)\\ 2 + \\A k x\\ 2 . (7) 



x k = (T*T rno +A{A k y 1 Tty, (8) 



Note that, for each fixed A k , the expression (7) can be written in the following 
way: 

l m -"-mo T" JS -k-"-k} 'moSi 

In practice the second expression is more complicated (the matrix to invert 
might be big), but it is simpler to deal with in order to show our results con- 
cerning the selection of A k . With this notation set the smoothing operator 

-<lfc • l-'mo mo "I" ^k-^-ki-ma I -'mo' 

We point out that choosing the smoothing sequence A k is the key point since 
it balances the two terms: if ||Afe|| is large the solution will be smooth but 
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will not, in general, comply to the observations. On the other hand, if \\Ak\\ is 
small, the solution might be too close to the noisy observations to yield a good 
approximation of xq. 

Remark that we can write R k = (D 2 + A^Aklmo) -1 D(GG t )~ 1 G, as a matrix, 
with R k its transpose matrix. 

3.1. Tikhonov regularization 

Consider the choice A k = i/ak~I mo , where a k is a positive decreasing sequence. 
In this case, the estimator (7) can be written as 



x k = arg mm 



\T^(y-Tx)f + a k \\x\\ 2 . (9) 



Where we recognize the usual expression of the Tikhonov estimator. Assume 
AN holds true. Hence Theorem 2.1 can be written in the following way, using 
the approximation properties of the Tikhonov regularization scheme, 

Proposition 3.1. Assume that K, n is such that d mo > sup fee y C a k . Then, 

by direct calculation of the trace and spectral radius of Rk under SF and SV, 
we have, for v < 1 

Tr(RJR k ) _ i/( 2p) 

p*(R k ) -° [ak h 

Also under IP and SC, standard approximation results (see for example [11], 
Chapter 1) the following inequality holds true 

E\\x k -x Q \\ 2 <2\\(I-U Xm Jx \\ 2 + C mi [ar + 2(a 1 k +1/2p n)-'} + ^^, (10) 

fccA-n n 

Hence under IP the above inequality yields, for v < 1, 

E\\x k - x \\ 2 < Cn~ i+4 4 P ^+2p . (ii) 

Interpreting this rate in the statistical literature reads s — 2vp: the regularity 
depends on the ill-poscdness of the problem. In the ill-posed literature the error 
is not related to the underlying dimension so that rates are different. Typically 
in a Hilbert scale setting, if the true solution xo belongs to the Hilbcrt space H s , 
optimal rates are of order 0(n~ s /( 2s+2p+1 )), see for example [7]. So, Equation 
10 implies that the penalized Tikhonov estimator xt, with k selected as in (5), 
achieves the best rate of convergence over all the possible choices of smoothing 
sequences a k , k G )C n . Moreover, if the input data is not too smooth, i.e for 
v ^ 1, Equation 11 implies that the penalized estimate achieves the minimax 
rate of convergence for this problem. 

To overcome the restrictions induced by the use of Tikhonov estimate, we 
turn to model selection estimators in the following part. 
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3.2. Regularization by projection 

Consider the projection estimator x m over X m . That is. x m is chosen in such 
a way as to minimize ||IIy m (j/ — Tx m )|| n . The estimation error can thus be 
bounded by 



\\x -x m \\ < ||(/-n Xm )a;o|| + 



l n L £ l 



(12) 



Since Y m is a sequence of linear subspaces E||ITy ej| 2 = 0(^-). So rates of 
convergence are of order 

^/2 



This rate depends on the ill-posedness of the operator and the approximation 
properties of X m . Indeed, following [13] and under SC we have if v < 1/2, 



|(/-n Xm )a;o|| <||(i-n£ )t|| 



o(rf" 2 ^), 



for a certain p. An optimal choice of the dimension (depending on v) leads to 
the rate 



x || = Op(n 4„ P+ 2p+i). 



(13) 



We aim at using Theorem 2.1 to select the best projection space among a col- 
lection of spaces. For this, consider a collection of index sets m = (ji,... ,jm) 
such that to C mo- For each m, define formally A m = {aij) with 

Vz 7^ j, Qij =0 Vi G m, an = 0, Vi Gj to, a^ = oo. 

Then we obtain a model selection estimator which can be written as 

x m = arg min \\A mo (y - Tx)\\ 2 + pen(m) (14) 

(m,x)£M n xX m 

where A mo = T+ Ily and pen(m) is defined as follows. Let A m j be the eigen- 
values of the matrix Hx m A mo (Hx m A mo ) t and {i m }mcm a collection of weights 
such that 



E(d) := 2 £ 



mCm-o 



ME 



jGm ^niJ 



sup 



j'gm ^m,j 



1 



n SUPjem ^ 



</L, 






= u Pje„ 



< OO 



for d as in lemma 4.3. As above, under SF, S(d) does not depend on n. Then, 
for some 9 > 0, and r = 2 + 9, set 



pen(m) = r(l + £,„)cr 2 ^ A mj + sup A r; 



j£m 



j£m 
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Proposition 3.2. Under conditions AN, SF, SV, IP and SC ; there exists a 
constant C which depends on r, T, c\ and C2 such that with high probability 

\\x m -xo\\ 2 ^2\\(I-n x )xo\\ 2 + C inf [d^ + 2(d 1 r + 2p n)- 1 ] + ^-. (15) 

1 meM n n 

Hence the penalized model selection estimator is optimal in the sense that 
it achieves the best rate among the collection of projection estimators. This 
leads to optimal rates also, as long as there exists a model m 6 M n with 
d m = n V(4«'P+2p+i) ) as s hown in Equation (13). 

We also have the following interpretation for this estimator which offers an 
important insight. Note that (14) is equivalent to minimizing 

x,f n = arg min arg min {-2 < A„ lQ y, A mo Tx m > +\\A mQ Tx m \\ 2 } + pen(m) 

mCm i m EX m 

= arg min arg min {-2 < U Xm A mo y,x m > +||x m || 2 } + pen(m). 

mCma i„,£X m 

Let {ej}jem be the canonical base over X m . Define for each m, 
x m ,j =< A mo y,ej >=< y,^ o ej >,j = l,...,m. 
Thus, m is selected by minimizing 

- X! X ™,3 + r(j2 ( 1 + L m ) ^2 ^rn,j + SUp X m , 



jEm 



76m 



We can recognize a hard thresholding scheme, defined in [ '], highlighting the 
equivalence between model selection and penalized M- estimation with an £° 
penalty, as is also quoted in [17]. 

4. Appendix 

In this section we give some technical lemmas. The next lemma characterizes 
the supremum of an empirical process by the norm of an orthogonal projection. 

Lemma 4.1. 

sup | <e lV > n | = ||ny e||„ (16) 

y£Y m , \\y\\ n = l 

Proof. Using the definition of an orthogonal projector, we have 

. . o . , , , , 2 



n „ _.| -n?- m £|L= min ||e-y|, 

II I'm II™ {y£Y m ,\\y\\ n = l} 

As a consequence we can write: 

IN™ - 2 < e, — 1— n^ e >„ +— J_||n^ 

II Y^Wn \\ ll Y m t \\n 

= min jlejlj, - 2 < e,y >„ +1 

{yeY m .\\ v \\ n =i}" 
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2 < e - n$ m £, l n^ m e > n +2 < n^e, — i n^ m e >„ 

= 2 sup I <e,y >„ | 

{2/ey m , ||y|U=i} 

||II?- m e|| n = sup |<e,j/>„|, 

{tfey m ,||w||»=i} 

which ends the proof. D 

The next result is a deviation inequality based on a functional exponential 
inequality (Theorem 7.4) due to [4] 2003. 

Lemma 4.2. Set r](A) = sup| H | =1 YT l= i £*0**«)* for A : 1" -> R k . Let 
v = E it SU P ^k{-) 2 +2MA)/(c-p 1/2 (A t A)). 

TTien, 

/ q(A) ^) ^- V c - 8 

\<t P V*{A*A) a P Vi(AtA) +V X+ J " ' 

Proof. Since the application u — ► A t u is continuous, we have rj{A) = 

sup uGS E£=i £ i(^ tu )i f° r & some countable subset of the unit ball. On the other 

hand, 

sup [A*u]j < sup ||A*it|| < p{A). 

\\u\\ = l ||«||=1 

Thus sup|| u ||< 1 |(A*u) i /p 1 / 2 (A*y4)| < 1. Also, following the proof of Corollary 
5.1 in [ ] 

(A'u) 2 

^ SU P 7 77 ^ 

11-^11=1 p(A*A) 

^ su p — TTT-n — 

l| tt ||=i pC^M) 

:= Zj. 

Set Z = Z(ei, ...,£„) = rj(A)/(o-p 1 / 2 (A t A)). Let Ej stand for the conditional 
expectation given e% for i ^= j. Hence, in the proof of Theorem 7.4 in [4] we may 
bound 

\e-\i (A t u) 2 / (A l u) 2 \ q ~ 2 

\Z-EjZl" < LA. sup ^—J± sup max M^L < (\ Sj \/ay Zj . 

<j q \\ u \\=i p(A t A) ||„| 1=1 « Vp(^A)y 

Thus, E|Z - EjZ|« < z,-g!/2. Finally note that 

Tr(A*A) 



£ 



■j 



p(^*a) ■ 
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Thus, the proof follows from Theorem 7.4 in [4]. □ 

As a corollary, we have the following lemma 

Lemma 4.3. • There exists a positive constant d that depends on r such that 
the following inequality holds 

P(r, 2 (A) > a 2 [Tr{A f A) + p(A t A)]r/2(l + L) + a 2 u) (17) 

< cxp{-v/rf(l/p(A*^)u + r/2L[Tr(A t A)/p(A t A) + 1])} 

• Setki = d/(/9(A*A)cr 2 ) andk 2 = dr/2L[Tr(A t A)/p(A t A) + l}. Then, there 
exists a constant C q , which depends only on q, such that, 

Efo 2 (A) - a 2 [Tr(A'A) + p(A*A)]r/2(l + L)]% (18) 

^ Onfii ^9 ' "^9 ^ 

holds. 
Proof. As a first step we will bound v. Since EZ < E 1 ' 2 Z 2 , we have 



, 2 
Si 



< (l + u)Ej2zi(^-\ +Tr{A t A)/ P {A t A). 



tm^ 



, 2 



\ 2 



Moreover, following, [1] p. 480, for all q ^ 2, the following version of Rosenthal's 
inequality holds: 



n / \ 2 

E<?/2 X) Zi ( ~ ) - 2 9/2 Tr(A*A)/p(A'A)l 

j= 1 ^ ' 



£i 



Hence, we have 

v < (1 4- vYTrUPAl/pUPA) + - 

v 

and 

2 2 (1 + ^) 2 Tr(A t A)/ /0 (A t A)E^ I i - ' 

er 4 



V 2 < 2 



Set < a < 1. Choose (5 and /3 such that if 

2 2 4!(5 2 (l + l/a)(l-^) 2 <ci, 
2<5 2 (l + l/a)(^) 2 <c 2 
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(17", 



and c = max((l + j3) max(ci, C2), (1 + /3)(1 + a)), then r/2 > c. Let u > and 
without loosing generality, assume cr = 1. Thus, 

P(rf(A) > (Tr(A*A) + p{A t A))r/2{l + L) + u) 

< P(r] 2 (A) > (Tr(A*A)(l + a) + (1 + l/a)5 2 V 2 p{A t A))(l + (3) 
+[r/2 - c](Tr(A*A) + ($Vp(A'yl)) + r/2L(Tr(A t A) + p(A*A)) + u) 

< P(V 2 (A) > (Tr(A*A)(l + a) + (1 + l/a)<5Vp(^4))(l + /?) 
+r/2L(Tr(A'A) + p{A t A)) + u) 



Set 



/? 



r (Tr^A) 
2L\pjA T Aj + 



p(A*A) 



The last term is equal to 
+ (1 + l/a)v 2 <5 2 ) (!+/?)+ r/2L 



Tr(jl'A) 
p(i4'.4) 



P 



S^y * (t(w (1 + a) + (1 + lla)vH2 ) ( i+ ® + ( i+ i 'w 



Finally, we may bound 



! 4^r > U ; ^4J . + <$V ) + /3) - ■ 1//3K 



< P 
= P 

<P 



^) 

p(i4*i4) 

77(A) 



> E 



p 1 / 2 {A t A) 

v(A) 
' ' p 1 l 2 {A t A) 



5v 



> E , V j A } - + Sv + (1 + 2/S)x" 



p 1 / 2 (A t A) ~ p 1 / 2 (A t A) 
^ A ) > E ,,^j „ + 



pV2(A«A) " pVz^A) 



= e-V^), 

where we have used repeatedly that for any constant c > 0, ca 2 + 1/cb 2 > 2ab 
and set 

g(A) = ((1 + l/P)- l (l + 2/6) 2 )(r/2L[Tr(A t A)/p(A t A) + 1] + u/p(A*A)). 

Set also d=[(l+ l//3)- 1 (l + 2/J) 2 ]- 1 and b(A) = Tr(A t A)/p{A t A). Thus we 
have shown the first part of the lemma. 
Moreover, using the above inequality, 



E[r] 2 (A) - cr 2 (Tr(A*A) + p(A t A))r /2{l + L)}\ 



OO 



: / a 2 1 qu 1-l e -y/dr/2L[b(A)+l]+du/(p(A*A)) du ^ 

la 
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Consider the change of variable w — du/(p(A t A)) + dr/2L[b(A) + 1], so that 
E[n 2 (A) - a^Tr^A) + p(A*A))r/ 2(1 + L)]\ 
f cj P(AA) \ r ( w -dr/2L[b(A) + l]y- l e-^dw. 

V d ) Jdr/2L[b(A) + l] 

The last expression is in turn bounded by 

a p (f A A [ e-^[w q - x + (dr/2L[b(A) + l]) q - l ]dw 

d J Jdr/2L[b(A) + l] 

ending the proof. 



□ 
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